chore(indexers): 80 rename vectorstore id column to label#144
chore(indexers): 80 rename vectorstore id column to label#144frayle-ons wants to merge 5 commits intomainfrom
Conversation
lukeroantreeONS
left a comment
There was a problem hiding this comment.
This is done well, and works.
This update is a good point to get the servers module naming conventions in sync with the indexers module though, so I've requested a few further changes.
There was a problem hiding this comment.
Can you update the columns in the response objects in the servers module to have a 1-1 match with the column names in the indexers module?
See differences for same input query here (from search, but to be reflected in embed/reverse_search too);
This would mean:
- rename 'input_label' -> 'query_label'
- add 'query_text' field at same level as 'query_label'
- rename 'label' -> 'doc_label',
- rename 'description' -> 'doc_text'
There was a problem hiding this comment.
refactored the pydantic models for th server in the latest commit
| """Atomic model for a single row of input data (i.e. a single query input) , includes 'id' and | ||
| 'description' which are expected as str type. | ||
| class SearchRequestEntry(BaseModel): | ||
| """Atomic model for a single row of VectorStore search method input data (i.e. a single query input) , includes 'id' and |
There was a problem hiding this comment.
Note to use backticks around VectorStore (and other object types) in docstrings for Quarto rendering
|
|
||
|
|
||
| class ResultEntry(BaseModel): | ||
| """Atomic model for a single row of vector store result data (i.e. a single vectorstore entry), |
There was a problem hiding this comment.
'vector store' / 'vectorstore' -> 'VectorStore'
| """Model for a list of many ResultEntry pydantic models, representing a ranked list of vector | ||
| store search results. | ||
| class SearchResponseSet(BaseModel): | ||
| """Model for a list of many SearchResponseEntry pydantic models, representing a ranked list of vector |
There was a problem hiding this comment.
Backticks, as noted with VectorStore, but for all object types
| ReverseSearchResponseSet( | ||
| input_id=input_id, | ||
| response=response_entries, | ||
| doc_label=group_df["doc_label"].iloc[0], # Assuming `doc_label` is the same for all rows in the group |
There was a problem hiding this comment.
Does this work in the case of partial-matching?
✨ Summary
These suggested changes update the naming conventions of the
VectorStoreclass. Previously VectorStores contained row entries with values for['id', 'text', 'embedding'](as well as a UUID column).This was proposed for semantic reasons - for most use cases of ClassifAI a label for each entry in a VectorStore is easier to understand as the relevance/classification label associated with than a row id which can be confused with the UUID column.
Corresponding to this change in the
VectorStoreand vectors.parquet file, the dataclasses have also been updated to refer to the new 'label' name, for example theVectorStoreSearchResultdataclass previously had a columndoc_idwhich has now been replaced bydoc_label. Several other dataclasses have been updated as well and this is reflected in newVectorStoreandServercode logic to process different operations when using the vectorstore.Note: I updated the dataclasses, and vectorstore logic for this PR. And then made changes to the Server module as the data it is trying to convert from VectorStore logic to an API response has changed. But I have not fully reconfigured the API Pydantic models in any way. We may want to consider a rework of the API endpoints we build in the servers module because the current setup seems to have out of data examples and remains very close to the original implementation of ClassifAI app. Left this out of this PR as it seemed out of scope from the ticket.📜 Changes Introduced
✅ Checklist
terraform fmt&terraform validate)🔍 How to Test
Standard environment setup with this branch of the repo installed.
I ran through each DEMO notebook, including the server deployment DEMO script and verified that all the notebook cells and endpoints ran correctly. I adjusted the notebooks for the new format dataclass objects.
Running these notebooks or another test script and seeing the the
VectorStore.search()method return a dataframe with the column 'doc_label' will show the external working of the new features. As well as a new input object and return object for the reverse search method.